Package: MSstats
Author: Anshuman Raina
Date: 5th Semptember 2024
MSstats, an R package in Bioconductor, supports protein significance analysis for statistical relative quantification of proteins and peptides in global, targeted and data-independent proteomics. It handles shotgun, label-free and label-based (universal synthetic peptide-based) SRM (selected reaction monitoring), and SWATH/DIA (data independent acquisition) experiments. It can be used for experiments with complex designs (e.g. comparing more than two experimental conditions, or a time course).
This vignette summarizes the introduction and various options of all
functionalities in MSstats. More details are available in
User Manual.
To install this package, start R (version “4.0”) and enter:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("MSstats")
library(MSstats)
To begin with, we will load sample datasets, including both annotated and plain data. The dataset you need can be found here.
We will also load the Annotation Dataset using MSstatsConvert. You can access this dataset here.
library(MSstats)
# Load data
pd_raw = system.file("tinytest/raw_data/PD/pd_input.csv",
package = "MSstatsConvert")
annotation_raw = system.file("tinytest/raw_data/PD/annot_pd.csv",
package = "MSstatsConvert")
pd = data.table::fread(pd_raw)
annotation = data.table::fread(annotation_raw)
head(pd, 5)
## Confidence.Level Search.ID Processing.Node.No Sequence Unique.Sequence.ID PSM.Ambiguity
## <char> <char> <int> <char> <int> <char>
## 1: High A 4 SLIASTLYR 1327 Unambiguous
## 2: High A 4 AYLATQGVEIR 2889 Unambiguous
## 3: High A 4 NHEIIGDIVPLAK 4700 Unambiguous
## 4: High A 4 NHEIIGDIVPLAK 4700 Unambiguous
## 5: High A 4 YHVNQYTGDESR 5209 Unambiguous
## Protein.Descriptions
## <char>
## 1: Uridine kinase OS=Escherichia coli (strain K12) GN=udk PE=3 SV=1 - [URK_ECOLI]
## 2: Imidazole glycerol phosphate synthase subunit HisF OS=Escherichia coli (strain K12) GN=hisF PE=1 SV=1 - [HIS6_ECOLI]
## 3: Imidazole glycerol phosphate synthase subunit HisF OS=Escherichia coli (strain K12) GN=hisF PE=1 SV=1 - [HIS6_ECOLI]
## 4: Imidazole glycerol phosphate synthase subunit HisF OS=Escherichia coli (strain K12) GN=hisF PE=1 SV=1 - [HIS6_ECOLI]
## 5: Imidazole glycerol phosphate synthase subunit HisF OS=Escherichia coli (strain K12) GN=hisF PE=1 SV=1 - [HIS6_ECOLI]
## X..Proteins X..Protein.Groups Protein.Group.Accessions Modifications Activation.Type DeltaScore DeltaCn
## <int> <int> <char> <char> <char> <int> <int>
## 1: 1 1 P0A8F4 CID 1 0
## 2: 1 1 P60664 CID 1 0
## 3: 1 1 P60664 CID 1 0
## 4: 1 1 P60664 CID 1 0
## 5: 1 1 P60664 CID 1 0
## Rank Search.Engine.Rank Precursor.Area QuanResultID Decoy.Peptides.Matched Exp.Value Homology.Threshold
## <int> <int> <num> <lgcl> <int> <num> <int>
## 1: 1 1 3.26e+07 NA NA 2.7e-01 13
## 2: 1 1 2.71e+08 NA NA 8.4e-05 13
## 3: 1 1 1.40e+08 NA NA 6.6e-03 13
## 4: 1 1 2.13e+08 NA NA 4.5e-04 13
## 5: 1 1 5.43e+06 NA NA 3.8e-02 13
## Identity.High Identity.Middle IonScore Peptides.Matched X..Missed.Cleavages Isolation.Interference....
## <int> <int> <int> <int> <int> <int>
## 1: 13 13 19 6 0 53
## 2: 13 13 54 9 0 25
## 3: 13 13 35 10 0 64
## 4: 13 13 46 10 0 50
## 5: 13 13 27 3 0 29
## Ion.Inject.Time..ms. Intensity Charge m.z..Da. MH...Da. Delta.Mass..Da. Delta.Mass..PPM. RT..min.
## <int> <num> <int> <num> <num> <int> <num> <num>
## 1: 3 1590000 2 512.2952 1023.583 0 -0.17 48.61
## 2: 0 17200000 2 610.8357 1220.664 0 0.67 45.31
## 3: 1 3100000 3 473.6051 1418.801 0 0.35 58.58
## 4: 3 2020000 2 709.9044 1418.802 0 0.92 58.53
## 5: 12 579000 2 734.8257 1468.644 0 -0.74 23.52
## First.Scan Last.Scan MS.Order Ions.Matched Matched.Ions Total.Ions
## <int> <int> <char> <char> <int> <int>
## 1: 14971 14971 MS2 Jun-74 6 74
## 2: 13599 13599 MS2 Sep-98 9 98
## 3: 19004 19004 MS2 8/128 8 128
## 4: 18981 18981 MS2 14/128 14 128
## 5: 4707 4707 MS2 8/112 8 112
## Spectrum.File Annotation
## <char> <lgcl>
## 1: 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1.raw NA
## 2: 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1.raw NA
## 3: 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1.raw NA
## 4: 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1.raw NA
## 5: 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1.raw NA
head(annotation, 5)
## Run Condition BioReplicate
## <char> <char> <int>
## 1: 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1.raw Condition1 1
## 2: 121219_S_CCES_01_02_LysC_Try_1to10_Mixt_1_2.raw Condition1 1
## 3: 121219_S_CCES_01_03_LysC_Try_1to10_Mixt_1_3.raw Condition1 1
## 4: 121219_S_CCES_01_04_LysC_Try_1to10_Mixt_2_1.raw Condition2 2
## 5: 121219_S_CCES_01_05_LysC_Try_1to10_Mixt_2_2.raw Condition2 2
We have the following converters, which allow you to convert various types of output reports which include the feature level data to the required input format of MSstats.
We show an example of how to use the above said Formatters.
input_data = read.csv("../data/diann_report.tsv",
sep="\t")
head(input_data)
## File.Name Run
## 1 230719_THP-1_Chrom_end2end_Plate1_DMSO_A10_DIA.d 230719_THP-1_Chrom_end2end_Plate1_DMSO_A10_DIA
## 2 230719_THP-1_Chrom_end2end_Plate1_DMSO_A05_DIA.d 230719_THP-1_Chrom_end2end_Plate1_DMSO_A05_DIA
## 3 230719_THP-1_Chrom_end2end_Plate1_DMSO_A12_DIA.d 230719_THP-1_Chrom_end2end_Plate1_DMSO_A12_DIA
## 4 230719_THP-1_Chrom_end2end_Plate1_Jakafi_A08_DIA.d 230719_THP-1_Chrom_end2end_Plate1_Jakafi_A08_DIA
## 5 230719_THP-1_Chrom_end2end_Plate1_DbET6_B08_DIA.d 230719_THP-1_Chrom_end2end_Plate1_DbET6_B08_DIA
## 6 230719_THP-1_Chrom_end2end_Plate1_DbET6_G10_DIA.d 230719_THP-1_Chrom_end2end_Plate1_DbET6_G10_DIA
## Protein.Group Protein.Ids Protein.Names Genes PG.Quantity PG.Normalised PG.MaxLFQ Genes.Quantity
## 1 Q9NVA2 Q9NVA2 SEP11_HUMAN SEPTIN11 79142.6 46870.4 49600.6 79142.6
## 2 Q9NVA2 Q9NVA2 SEP11_HUMAN SEPTIN11 97279.9 58082.6 55838.1 97279.9
## 3 Q9NVA2 Q9NVA2 SEP11_HUMAN SEPTIN11 75382.3 39671.6 46009.7 75382.3
## 4 Q9NVA2 Q9NVA2 SEP11_HUMAN SEPTIN11 99529.9 49235.4 52154.0 99529.9
## 5 Q9NVA2 Q9NVA2 SEP11_HUMAN SEPTIN11 88499.7 52884.9 56081.4 88499.7
## 6 Q9NVA2 Q9NVA2 SEP11_HUMAN SEPTIN11 74063.6 64431.5 61170.8 74063.6
## Genes.Normalised Genes.MaxLFQ Genes.MaxLFQ.Unique Modified.Sequence Stripped.Sequence
## 1 46870.4 49600.6 49600.6 AAAQLLQSQ(UniMod:7)AQQSGAQQTK AAAQLLQSQAQQSGAQQTK
## 2 58082.6 55838.1 55838.1 AAAQLLQSQ(UniMod:7)AQQSGAQQTK AAAQLLQSQAQQSGAQQTK
## 3 39671.6 46009.7 46009.7 AAAQLLQSQ(UniMod:7)AQQSGAQQTK AAAQLLQSQAQQSGAQQTK
## 4 49235.4 52154.0 52154.0 AAAQLLQSQ(UniMod:7)AQQSGAQQTK AAAQLLQSQAQQSGAQQTK
## 5 52884.9 56081.4 56081.4 AAAQLLQSQAQQSGAQQTK AAAQLLQSQAQQSGAQQTK
## 6 64431.5 61170.8 61170.8 AAAQLLQSQAQQSGAQQTK AAAQLLQSQAQQSGAQQTK
## Precursor.Id Precursor.Charge Q.Value PEP Global.Q.Value Protein.Q.Value
## 1 AAAQLLQSQ(UniMod:7)AQQSGAQQTK2 2 6.59541e-03 1.87628e-01 1.09371e-02 0.000293945
## 2 AAAQLLQSQ(UniMod:7)AQQSGAQQTK2 2 5.09463e-03 1.58092e-01 1.09371e-02 0.000301841
## 3 AAAQLLQSQ(UniMod:7)AQQSGAQQTK2 2 3.51206e-04 4.28005e-03 1.09371e-02 0.000279096
## 4 AAAQLLQSQ(UniMod:7)AQQSGAQQTK2 2 7.36836e-04 1.27557e-02 1.09371e-02 0.000293341
## 5 AAAQLLQSQAQQSGAQQTK2 2 6.73729e-03 1.99875e-01 4.46056e-03 0.000301750
## 6 AAAQLLQSQAQQSGAQQTK3 3 1.32103e-05 1.32103e-05 2.78130e-09 0.000298775
## PG.Q.Value Global.PG.Q.Value GG.Q.Value Translated.Q.Value Proteotypic Precursor.Quantity
## 1 0.000293169 0.000270051 0.000293945 0 1 132.008
## 2 0.000300842 0.000270051 0.000301841 0 1 110.006
## 3 0.000278319 0.000270051 0.000279096 0 1 765.040
## 4 0.000292312 0.000270051 0.000293341 0 1 331.017
## 5 0.000301023 0.000270051 0.000301750 0 1 328.017
## 6 0.000297619 0.000270051 0.000298775 0 1 11670.300
## Precursor.Normalised Precursor.Translated Translated.Quality Ms1.Translated Quantity.Quality RT
## 1 69.2365 115.112 NA 27216.6 0.816497 5.49752
## 2 65.1416 100.941 NA 18976.7 0.656910 5.57173
## 3 360.6170 653.304 NA 41557.7 0.691451 5.51056
## 4 161.4600 292.662 NA 42097.8 0.767069 5.47232
## 5 165.5430 274.266 NA 36324.9 0.350367 5.54726
## 6 13575.6000 15046.300 NA 23575.0 0.953474 5.44155
## RT.Start RT.Stop iRT Predicted.RT Predicted.iRT First.Protein.Description Lib.Q.Value Lib.PG.Q.Value
## 1 5.47311 5.52203 24.1531 5.51437 24.0711 Septin-11 0.000441969 0.000307314
## 2 5.54731 5.59616 24.1531 5.59037 24.0589 Septin-11 0.000441969 0.000307314
## 3 5.46164 5.53499 24.1531 5.52646 24.0707 Septin-11 0.000441969 0.000307314
## 4 5.43564 5.49676 24.1531 5.51338 23.9568 Septin-11 0.000441969 0.000307314
## 5 5.52279 5.57168 24.0491 5.53787 24.0952 Septin-11 0.000148948 0.000307314
## 6 5.39265 5.51484 24.0238 5.44345 24.0249 Septin-11 0.000000001 0.000307314
## Ms1.Profile.Corr Ms1.Area Evidence Spectrum.Similarity Averagine Mass.Evidence CScore Decoy.Evidence
## 1 0.551782 31211.4 1.61919 0.222315 1.0000000 0.000000 0.851078 0.000000
## 2 0.785901 20680.9 1.90409 0.117410 0.0987181 0.000000 0.873837 0.000000
## 3 0.766113 48665.3 2.08056 0.395749 0.0591408 0.000000 0.995963 0.000000
## 4 0.846188 47614.9 2.97196 0.533900 1.0000000 0.000000 0.988132 0.000000
## 5 0.665084 43443.8 1.53823 0.198184 0.8581840 0.000000 0.843245 0.000000
## 6 0.976121 18285.4 4.53526 0.438272 1.0000000 0.588394 0.999988 0.830825
## Decoy.CScore Fragment.Quant.Raw
## 1 -1.00000e+07 132.008;0;0;69.0042;0;0;0;0;0;131.008;0;0;
## 2 -1.00000e+07 67.0042;0;43.0021;153.008;0;0;0;0;0;0;0;0;
## 3 -1.00000e+07 386.021;131.006;248.013;206.008;0;88.0042;0;61.0021;0;0;0;0;
## 4 -1.00000e+07 0;167.008;164.008;258.011;229.01;0;0;166.006;0;0;0;0;
## 5 -1.00000e+07 144.008;171.008;0;184.008;83.0042;0;56.0021;0;0;0;0;0;
## 6 1.03618e-01 5303.15;2541.07;3826.1;2771.07;783.019;3142.07;1047.03;863.022;1214.03;850.02;401.011;0;
## Fragment.Quant.Corrected
## 1 132.008;0;0;69.0042;0;0;0;0;0;131.008;0;0;
## 2 67.0042;0;43.0021;153.008;0;0;0;0;0;0;0;0;
## 3 386.021;131.006;248.013;206.008;0;88.0042;0;61.0021;0;0;0;0;
## 4 0;167.008;164.008;258.011;229.01;0;0;166.006;0;0;0;0;
## 5 144.008;171.008;0;184.008;83.0042;0;56.0021;0;0;0;0;0;
## 6 5303.15;2541.07;3826.1;2771.07;783.019;3142.07;1047.03;863.022;1214.03;850.02;401.011;0;
## Fragment.Correlations
## 1 0.816497;0;0;0.816497;0;0;0;0;0;0.408248;0;0;
## 2 0.816497;0;0.408248;0.476696;0;0;0;0;0;0;0;0;
## 3 0.774108;0.674748;0.571623;0.500944;0;0.721572;0;0.500944;0;0;0;0;
## 4 0;0.595407;0.941872;0.832934;0.715275;0;0;0.732418;0;0;0;0;
## 5 0.276408;0.816497;0;0.408248;0.816497;0;0.408248;0;0;0;0;0;
## 6 0.968525;0.99425;0.905533;0.900774;0.861883;0.819064;0.804806;0.929188;0.75831;0.662669;0.872515;0;
## MS2.Scan IM iIM Predicted.IM Predicted.iIM
## 1 10775 1.183750 1.185000 1.179540 1.188150
## 2 10919 1.185000 1.185000 1.177780 1.190830
## 3 10799 1.185000 1.185000 1.179370 1.189810
## 4 10727 1.193520 1.185000 1.179870 1.197940
## 5 10871 1.192500 1.183860 1.179140 1.196260
## 6 10663 0.900937 0.897083 0.884704 0.912769
annotation_file = read.csv("../data/diann_annotation.csv")
head(annotation_file)
## Run Tx Well Unique.Peptides Plate Condition BioReplicate
## 1 230719_THP-1_Chrom_end2end_Plate1_DMSO_A02_DIA DMSO A02 45341 Plate1 DMSO 2
## 2 230719_THP-1_Chrom_end2end_Plate1_DMSO_A05_DIA DMSO A05 44666 Plate1 DMSO 5
## 3 230719_THP-1_Chrom_end2end_Plate1_Jakafi_A08_DIA Jakafi A08 47029 Plate1 Jakafi 8
## 4 230719_THP-1_Chrom_end2end_Plate1_DMSO_A10_DIA DMSO A10 46326 Plate1 DMSO 10
## 5 230719_THP-1_Chrom_end2end_Plate1_DMSO_A12_DIA DMSO A12 47363 Plate1 DMSO 12
## 6 230719_THP-1_Chrom_end2end_Plate1_DMSO_B01_DIA DMSO B01 46873 Plate1 DMSO 13
msstats_format = MSstatsConvert::DIANNtoMSstatsFormat(input_data,
annotation=annotation_file,
global_qvalue_cutoff = 0.01,
qvalue_cutoff = 0.01,
pg_qvalue_cutoff = 0.01,
useUniquePeptide = TRUE,
removeFewMeasurements = TRUE,
removeOxidationMpeptides = TRUE,
removeProtein_with1Feature = TRUE,
MBR=TRUE)
head(msstats_format)
## ProteinName PeptideSequence PrecursorCharge FragmentIon ProductCharge IsotopeLabelType
## 1 SEP11_HUMAN AAAQLLQSQ(UniMod:7)AQQSGAQQTK 2 Frag1 1 Light
## 2 SEP11_HUMAN AAAQLLQSQ(UniMod:7)AQQSGAQQTK 2 Frag3 1 Light
## 3 SEP11_HUMAN AAAQLLQSQ(UniMod:7)AQQSGAQQTK 2 Frag4 1 Light
## 4 SEP11_HUMAN AAAQLLQSQ(UniMod:7)AQQSGAQQTK 2 Frag1 1 Light
## 5 SEP11_HUMAN AAAQLLQSQ(UniMod:7)AQQSGAQQTK 2 Frag3 1 Light
## 6 SEP11_HUMAN AAAQLLQSQ(UniMod:7)AQQSGAQQTK 2 Frag4 1 Light
## Condition BioReplicate Run Fraction Intensity
## 1 DMSO 10 230719_THP-1_Chrom_end2end_Plate1_DMSO_A10_DIA 1 132.0080
## 2 DMSO 10 230719_THP-1_Chrom_end2end_Plate1_DMSO_A10_DIA 1 0.0000
## 3 DMSO 10 230719_THP-1_Chrom_end2end_Plate1_DMSO_A10_DIA 1 69.0042
## 4 DMSO 5 230719_THP-1_Chrom_end2end_Plate1_DMSO_A05_DIA 1 67.0042
## 5 DMSO 5 230719_THP-1_Chrom_end2end_Plate1_DMSO_A05_DIA 1 43.0021
## 6 DMSO 5 230719_THP-1_Chrom_end2end_Plate1_DMSO_A05_DIA 1 153.0080
The imported data from Step 1.1. now must be converted through MSStatsConvert package’s PDtoMSstatsFormat converter.
This function converts the Proteome discoverer output into the required input format for MSstats.
MSstatsConvert::PDtoMSstatsFormat(input,
annotation,
useNumProteinsColumn=FALSE,
useUniquePeptide=TRUE,
summaryforMultipleRows=max,
fewMeasurements="remove",
removeOxidationMpeptides=FALSE,
removeProtein_with1Peptide=FALSE,
which.quantification = 'Precursor.Area',
which.proteinid = 'Protein.Group.Accessions',
which.sequence = 'Sequence')
Actual Data modification can be seen below:
pd_imported = MSstatsConvert::PDtoMSstatsFormat(pd, annotation, use_log_file = FALSE)
## INFO [2024-09-08 20:05:03] ** Raw data from ProteomeDiscoverer imported successfully.
## INFO [2024-09-08 20:05:03] ** Raw data from ProteomeDiscoverer cleaned successfully.
## INFO [2024-09-08 20:05:03] ** Using provided annotation.
## INFO [2024-09-08 20:05:03] ** Run labels were standardized to remove symbols such as '.' or '%'.
## INFO [2024-09-08 20:05:03] ** The following options are used:
## - Features will be defined by the columns: PeptideSequence, PrecursorCharge
## - Shared peptides will be removed.
## - Proteins with single feature will not be removed.
## - Features with less than 3 measurements across runs will be removed.
## INFO [2024-09-08 20:05:03] ** Features with all missing measurements across runs are removed.
## INFO [2024-09-08 20:05:03] ** Shared peptides are removed.
## INFO [2024-09-08 20:05:03] ** Multiple measurements in a feature and a run are summarized by summaryforMultipleRows: max
## INFO [2024-09-08 20:05:03] ** Features with one or two measurements across runs are removed.
## INFO [2024-09-08 20:05:03] ** Run annotation merged with quantification data.
## INFO [2024-09-08 20:05:03] ** Features with one or two measurements across runs are removed.
## INFO [2024-09-08 20:05:03] ** Fractionation handled.
## INFO [2024-09-08 20:05:03] ** Updated quantification data to make balanced design. Missing values are marked by NA
## INFO [2024-09-08 20:05:03] ** Finished preprocessing. The dataset is ready to be processed by the dataProcess function.
head(pd_imported)
## ProteinName PeptideModifiedSequence PrecursorCharge FragmentIon ProductCharge IsotopeLabelType Condition
## 1 P0ABU9 ANSHAPEAVVEGASR 2 NA NA L Condition1
## 2 P0ABU9 ANSHAPEAVVEGASR 2 NA NA L Condition1
## 3 P0ABU9 ANSHAPEAVVEGASR 2 NA NA L Condition1
## 4 P0ABU9 ANSHAPEAVVEGASR 2 NA NA L Condition2
## 5 P0ABU9 ANSHAPEAVVEGASR 2 NA NA L Condition2
## 6 P0ABU9 ANSHAPEAVVEGASR 2 NA NA L Condition2
## BioReplicate Run Fraction Intensity
## 1 1 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1raw 1 21400000
## 2 1 121219_S_CCES_01_02_LysC_Try_1to10_Mixt_1_2raw 1 17500000
## 3 1 121219_S_CCES_01_03_LysC_Try_1to10_Mixt_1_3raw 1 NA
## 4 2 121219_S_CCES_01_04_LysC_Try_1to10_Mixt_2_1raw 1 11600000
## 5 2 121219_S_CCES_01_05_LysC_Try_1to10_Mixt_2_2raw 1 12000000
## 6 2 121219_S_CCES_01_06_LysC_Try_1to10_Mixt_2_3raw 1 16200000
Once we import the dataset correctly with Formatter, we need to pre-process the data which is taken care by dataProcess.
This step is essentially Data pre-processing and quality control of MS runs of the original raw data into quantitative data for model fitting and group comparison.
There are 3 main steps that happen in background :
Normalization - There are three different normalizations supported. ‘equalizeMedians’ (default) represents constant normalization (equalizing the medians) based on reference signals is performed. ‘quantile’ represents quantile normalization based on reference signals is performed. ‘globalStandards’ represents normalization with global standards proteins. FALSE represents no normalization is performed.
Feature selection - This also has three options i.e. Select All features, Top-N features (by mean intensity) or “Best” features.
Missing value imputation - We impute plausible values in case of missing data points. The RunLevelData can be queried to show Number of imputed intensities (censored intensities) in a RUN and Protein.
summarized = dataProcess(pd_imported)
## INFO [2024-09-08 20:05:03] ** Log2 intensities under cutoff = 23.053 were considered as censored missing values.
## INFO [2024-09-08 20:05:03] ** Log2 intensities = NA were considered as censored missing values.
## INFO [2024-09-08 20:05:03] ** Use all features that the dataset originally has.
## INFO [2024-09-08 20:05:03]
## # proteins: 5
## # peptides per protein: 1-16
## # features per peptide: 1-1
## INFO [2024-09-08 20:05:03] Some proteins have only one feature:
## P00363,
## P0A8J2 ...
## INFO [2024-09-08 20:05:03]
## Condition1 Condition2 Condition3 Condition4 Condition5
## # runs 3 3 3 3 3
## # bioreplicates 1 1 1 1 1
## # tech. replicates 3 3 3 3 3
## INFO [2024-09-08 20:05:03] Some features are completely missing in at least one condition:
## LDEGcTERC5(Carbamidomethyl)_2_NA_NA,
## ELREQVGDEHIGVIPEDcYYKC18(Carbamidomethyl)_3_NA_NA,
## TNYDHPSAMDHSLLLEHLQALK_3_NA_NA,
## LARPGSDVALDDQLYQEPQAAPVAVPMGK_3_NA_NA,
## AYLATQGVEIR_2_NA_NA ...
## INFO [2024-09-08 20:05:03] == Start the summarization per subplot...
## | | | 0% | |==================== | 20% | |======================================== | 40% | |============================================================ | 60% | |================================================================================ | 80% | |====================================================================================================| 100%
## INFO [2024-09-08 20:05:04] == Summarization is done.
head(summarized$FeatureLevelData)
## PROTEIN PEPTIDE TRANSITION FEATURE LABEL GROUP RUN SUBJECT FRACTION
## 1 P0ABU9 ANSHAPEAVVEGASR_2 NA_NA ANSHAPEAVVEGASR_2_NA_NA L Condition1 1 1 1
## 2 P0ABU9 ANSHAPEAVVEGASR_2 NA_NA ANSHAPEAVVEGASR_2_NA_NA L Condition1 2 1 1
## 3 P0ABU9 ANSHAPEAVVEGASR_2 NA_NA ANSHAPEAVVEGASR_2_NA_NA L Condition1 3 1 1
## 4 P0ABU9 ANSHAPEAVVEGASR_2 NA_NA ANSHAPEAVVEGASR_2_NA_NA L Condition2 4 2 1
## 5 P0ABU9 ANSHAPEAVVEGASR_2 NA_NA ANSHAPEAVVEGASR_2_NA_NA L Condition2 5 2 1
## 6 P0ABU9 ANSHAPEAVVEGASR_2 NA_NA ANSHAPEAVVEGASR_2_NA_NA L Condition2 6 2 1
## originalRUN censored INTENSITY ABUNDANCE newABUNDANCE predicted
## 1 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1raw FALSE 21400000 23.71945 23.71945 NA
## 2 121219_S_CCES_01_02_LysC_Try_1to10_Mixt_1_2raw FALSE 17500000 24.06085 24.06085 NA
## 3 121219_S_CCES_01_03_LysC_Try_1to10_Mixt_1_3raw TRUE NA NA 22.77604 22.77604
## 4 121219_S_CCES_01_04_LysC_Try_1to10_Mixt_2_1raw FALSE 11600000 23.77304 23.77304 NA
## 5 121219_S_CCES_01_05_LysC_Try_1to10_Mixt_2_2raw TRUE 12000000 23.00805 22.95207 22.95207
## 6 121219_S_CCES_01_06_LysC_Try_1to10_Mixt_2_3raw FALSE 16200000 23.74312 23.74312 NA
head(summarized$ProteinLevelData)
## RUN Protein LogIntensities originalRUN GROUP SUBJECT
## 1 1 P0A8F4 22.96185 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1raw Condition1 1
## 2 2 P0A8F4 23.27048 121219_S_CCES_01_02_LysC_Try_1to10_Mixt_1_2raw Condition1 1
## 3 3 P0A8F4 23.44357 121219_S_CCES_01_03_LysC_Try_1to10_Mixt_1_3raw Condition1 1
## 4 4 P0A8F4 23.31217 121219_S_CCES_01_04_LysC_Try_1to10_Mixt_2_1raw Condition2 2
## 5 6 P0A8F4 23.87516 121219_S_CCES_01_06_LysC_Try_1to10_Mixt_2_3raw Condition2 2
## 6 8 P0A8F4 24.31958 121219_S_CCES_01_08_LysC_Try_1to10_Mixt_3_2raw Condition3 3
## TotalGroupMeasurements NumMeasuredFeature MissingPercentage more50missing NumImputedFeature
## 1 9 1 0.6666667 TRUE 2
## 2 9 1 0.6666667 TRUE 2
## 3 9 1 0.6666667 TRUE 2
## 4 9 1 0.6666667 TRUE 2
## 5 9 1 0.6666667 TRUE 2
## 6 9 2 0.3333333 FALSE 1
head(summarized$SummaryMethod)
## [1] "TMP"
After dataProcessing the input data, MSstats provides multiple plots to analyze the pre-processed data. Here we show the various types of plots we can use. Once invoked, a pdf file will be downloaded with corresponding feature level data and the Plot generated.
# Profile plot
dataProcessPlots(data=summarized, type="ProfilePlot")
# Quality control plot
dataProcessPlots(data=summarized, type="QCPlot")
# Quantification plot for conditions
dataProcessPlots(data=summarized, type="ConditionPlot")
In this step we test for significant changes in protein abundance across conditions based on a family of linear mixed-effects models in targeted Selected Reaction Monitoring (SRM), Data-Dependent Acquisition (DDA or shotgun), and Data-Independent Acquisition (DIA or SWATH-MS) experiment.
It is applicable to multiple types of sample preparation, including label-free workflows, workflows that use stable isotope labeled reference proteins and peptides, and workflows that use fractionation. Experimental design of case-control study (patients are not repeatedly measured) or time course study (patients are repeatedly measured) is automatically determined based on proper statistical model.
# Model
labels = unique(summarized$ProteinLevelData$GROUP)
contrast_matrix = MSstatsContrastMatrix('pairwise', labels)
model = groupComparison(contrast_matrix, summarized)
## INFO [2024-09-08 20:05:09] == Start to test and get inference in whole plot ...
## | | | 0% | |========================= | 25% | |================================================== | 50% | |=========================================================================== | 75% | |====================================================================================================| 100%
## INFO [2024-09-08 20:05:09] == Comparisons for all proteins are done.
Model Details
head(model$ModelQC)
## RUN Protein ABUNDANCE originalRUN GROUP SUBJECT
## 1 1 P0A8F4 22.96185 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1raw Condition1 1
## 2 2 P0A8F4 23.27048 121219_S_CCES_01_02_LysC_Try_1to10_Mixt_1_2raw Condition1 1
## 3 3 P0A8F4 23.44357 121219_S_CCES_01_03_LysC_Try_1to10_Mixt_1_3raw Condition1 1
## 4 4 P0A8F4 23.31217 121219_S_CCES_01_04_LysC_Try_1to10_Mixt_2_1raw Condition2 2
## 5 6 P0A8F4 23.87516 121219_S_CCES_01_06_LysC_Try_1to10_Mixt_2_3raw Condition2 2
## 6 8 P0A8F4 24.31958 121219_S_CCES_01_08_LysC_Try_1to10_Mixt_3_2raw Condition3 3
## TotalGroupMeasurements NumMeasuredFeature MissingPercentage more50missing NumImputedFeature residuals
## 1 9 1 0.6666667 TRUE 2 -0.26344817
## 2 9 1 0.6666667 TRUE 2 0.04517815
## 3 9 1 0.6666667 TRUE 2 0.21827003
## 4 9 1 0.6666667 TRUE 2 -0.28149357
## 5 9 1 0.6666667 TRUE 2 0.28149357
## 6 9 2 0.3333333 FALSE 1 0.66038185
## fitted
## 1 23.22530
## 2 23.22530
## 3 23.22530
## 4 23.59366
## 5 23.59366
## 6 23.65919
Visualization for model-based analysis and summarizing differentially abundant proteins. To summarize the results of log-fold changes and adjusted p-values for differentially abundant proteins, groupComparisonPlots takes testing results from function groupComparison as input and automatically generate three types of figures in pdf files as output :
Volcano plot : For each comparison separately. It illustrates actual log-fold changes and adjusted p-values for each comparison separately with all proteins. The x-axis is the log fold change. The base of logarithm transformation is the same as specified in “logTrans” from dataProcess. The y-axis is the negative log2 or log10 adjusted p-values. The horizontal dashed line represents the FDR cutoff. The points below the FDR cutoff line are non-significantly abundant proteins (colored in black). The points above the FDR cutoff line are significantly abundant proteins (colored in red/blue for up-/down-regulated). If fold change cutoff is specified (FCcutoff = specific value), the points above the FDR cutoff line but within the FC cutoff line are non-significantly abundant proteins (colored in black).
Heatmap : For multiple comparisons. It illustratea up-/down-regulated proteins for multiple comparisons with all proteins. Each column represents each comparison of interest. Each row represents each protein. Color red/blue represents proteins in that specific comparison are significantly up-regulated/down-regulated proteins with FDR cutoff and/or FC cutoff. The color scheme shows the evidences of significance. The darker color it is, the stronger evidence of significance it has. Color gold represents proteins are not significantly different in abundance.
Comparison plot : For multiple comparisons per protein. It illustrates log-fold change and its variation of multiple comparisons for single protein. X-axis is comparison of interest. Y-axis is the log fold change. The red points are the estimated log fold change from the model. The error bars are the confidence interval with 0.95 significant level for log fold change. This interval is only based on the standard error, which is estimated from the model.
groupComparisonPlots(
model$ComparisonResult,
type="Heatmap",
sig = 0.05,
FCcutoff = FALSE,
logBase.pvalue = 10,
ylimUp = FALSE,
ylimDown = FALSE,
xlimUp = FALSE,
x.axis.size = 10,
y.axis.size = 10,
dot.size = 3,
text.size = 4,
text.angle = 0,
legend.size = 13,
ProteinName = TRUE,
colorkey = TRUE,
numProtein = 100,
clustering = "both",
width = 800,
height = 600,
which.Comparison = "all",
which.Protein = "all",
address = FALSE,
isPlotly = TRUE
)
Calculate sample size for future experiments of a Selected Reaction Monitoring (SRM), Data-Dependent Acquisition (DDA or shotgun), and Data-Independent Acquisition (DIA or SWATH-MS) experiment based on intensity-based linear model. The function fits the model and uses variance components to calculate sample size. The underlying model fitting with intensity-based linear model with technical MS run replication. Estimated sample size is rounded to 0 decimal. Two options of the calculation:
sample_size_calc = designSampleSize(model$FittedModel,
desiredFC=c(1.75,2.5),
power = TRUE,
numSample=5)
To illustrate the relationship of desired fold change and the calculated minimal number sample size which are
Number of biological replicates per condition power. The input is the result from function designSampleSize.
designSampleSizePlots(sample_size_calc, isPlotly=FALSE)
Model-based quantification for each condition or for each biological samples per protein in a targeted Selected Reaction Monitoring (SRM), Data-Dependent Acquisition (DDA or shotgun), and Data-Independent Acquisition (DIA or SWATH-MS) experiment. Quantification takes the processed data set by dataProcess as input and automatically generate the quantification results (data.frame) with long or matrix format. The quantification for endogenous samples is based on run summarization from subplot model, with TMP robust estimation.
Sample quantification : individual biological sample quantification for each protein. The label of each biological sample is a combination of the corresponding group and the sample ID. If there are no technical replicates or experimental replicates per sample, sample quantification is the same as run summarization from dataProcess (xxx$RunlevelData from dataProcess). If there are technical replicates or experimental replicates, sample quantification is median among run quantification corresponding MS runs.
Group quantification : quantification for individual group or individual condition per protein. It is median among sample quantification.
sample_quant_long = quantification(summarized,
type = "Sample",
format = "long")
sample_quant_long
## Protein Group_Subject LogIntensity
## <fctr> <fctr> <num>
## 1: P0A8F4 Condition1_1 23.27048
## 2: P0A8J2 Condition1_1 25.41377
## 3: P0ABU9 Condition1_1 23.94076
## 4: P60664 Condition1_1 26.95914
## 5: P0A8F4 Condition2_2 23.59366
## 6: P0A8J2 Condition2_2 25.37768
## 7: P0ABU9 Condition2_2 24.35179
## 8: P60664 Condition2_2 27.36184
## 9: P0A8F4 Condition3_3 23.65919
## 10: P0A8J2 Condition3_3 24.84218
## 11: P0ABU9 Condition3_3 23.97927
## 12: P60664 Condition3_3 26.89201
## 13: P0A8F4 Condition4_4 24.04638
## 14: P0A8J2 Condition4_4 NA
## 15: P0ABU9 Condition4_4 24.96019
## 16: P60664 Condition4_4 27.69317
## 17: P0A8F4 Condition5_5 24.50374
## 18: P0A8J2 Condition5_5 NA
## 19: P0ABU9 Condition5_5 25.42248
## 20: P60664 Condition5_5 27.98325
## Protein Group_Subject LogIntensity
sample_quant_wide = quantification(summarized,
type = "Sample",
format = "matrix")
sample_quant_wide
## Key: <Protein>
## Protein Condition1_1 Condition2_2 Condition3_3 Condition4_4 Condition5_5
## <fctr> <num> <num> <num> <num> <num>
## 1: P0A8F4 23.27048 23.59366 23.65919 24.04638 24.50374
## 2: P0A8J2 25.41377 25.37768 24.84218 NA NA
## 3: P0ABU9 23.94076 24.35179 23.97927 24.96019 25.42248
## 4: P60664 26.95914 27.36184 26.89201 27.69317 27.98325
group_quant_long = quantification(summarized,
type = "Group",
format = "long")
group_quant_long
## Protein Group LogIntensity
## <fctr> <fctr> <num>
## 1: P0A8F4 Condition1 23.27048
## 2: P0A8J2 Condition1 25.41377
## 3: P0ABU9 Condition1 23.94076
## 4: P60664 Condition1 26.95914
## 5: P0A8F4 Condition2 23.59366
## 6: P0A8J2 Condition2 25.37768
## 7: P0ABU9 Condition2 24.35179
## 8: P60664 Condition2 27.36184
## 9: P0A8F4 Condition3 23.65919
## 10: P0A8J2 Condition3 24.84218
## 11: P0ABU9 Condition3 23.97927
## 12: P60664 Condition3 26.89201
## 13: P0A8F4 Condition4 24.04638
## 14: P0A8J2 Condition4 NA
## 15: P0ABU9 Condition4 24.96019
## 16: P60664 Condition4 27.69317
## 17: P0A8F4 Condition5 24.50374
## 18: P0A8J2 Condition5 NA
## 19: P0ABU9 Condition5 25.42248
## 20: P60664 Condition5 27.98325
## Protein Group LogIntensity
group_quant_wide = quantification(summarized,
type = "Group",
format = "matrix")
group_quant_wide
## Key: <Protein>
## Protein Condition1 Condition2 Condition3 Condition4 Condition5
## <fctr> <num> <num> <num> <num> <num>
## 1: P0A8F4 23.27048 23.59366 23.65919 24.04638 24.50374
## 2: P0A8J2 25.41377 25.37768 24.84218 NA NA
## 3: P0ABU9 23.94076 24.35179 23.97927 24.96019 25.42248
## 4: P60664 26.95914 27.36184 26.89201 27.69317 27.98325